Mosquitoes and monkeys: Mapping cases to investigate yellow fever virus emergence

2nd CADDE Workshop on Portable Metagenomics for Pathogen Surveillance, Sao Paulo 13-17 March 2022
Author

Dr. Sabrina Li

Published

March 15, 2023


Introduction

Yellow fever virus is blah blah blah -

-Briefly discuss history

-Briefly discuss transmission cycle

-briefly discuss latest YFV outbreak

This tutorial will teach you how to make maps in RStudio to visualisase

Mosquitoes and non-human primates (NHP)

Let’s load Horto_MO_NHP.csv into RStudio. This file contains data on confirmed cases of yellow fever virus extracted from mosquitoes and non-human primates sampled in Horto, a neighbourhood in Sao Paulo, and elsewhere. Samples were collected between 2017 and 2018.

horto<-read.csv("Horto_MO_NHP.csv")

Let’s have a look at the data. We can run View(horto) to open the data frame. A preview of the data is shown below.

ID source Location Accession_Number Strain Country State Municipality latitude longitude Host_species date
New_mosquitoes|pool119|Brazil|SP|SaoPaulo_PEAL|2018-01-05 This_study SaoPaulo_PEAL New_mosquitoes pool119 Brazil SP Sao_Paulo -23.46634 -46.63147 Haemagogus leucocelaenus 05/01/2018
New_mosquitoes|pool120|Brazil|SP|SaoPaulo_PEAL|2018-01-05 This_study SaoPaulo_PEAL New_mosquitoes pool120 Brazil SP Sao_Paulo -23.46634 -46.63147 Haemagogus leucocelaenus 05/01/2018
New_mosquitoes|pool177|Brazil|SP|SaoPaulo_PEAL|2018-01-11 This_study SaoPaulo_PEAL New_mosquitoes pool177 Brazil SP Sao_Paulo -23.46537 -46.63098 Haemagogus leucocelaenus 11/01/2018
New_mosquitoes|pool182|Brazil|SP|SaoPaulo_PEAL|2018-01-11 This_study SaoPaulo_PEAL New_mosquitoes pool182 Brazil SP Sao_Paulo -23.46537 -46.63098 Haemagogus leucocelaenus 11/01/2018
New_mosquitoes|pool192|Brazil|SP|SaoPaulo_PEAL|2018-01-11 This_study SaoPaulo_PEAL New_mosquitoes pool192 Brazil SP Sao_Paulo -23.46537 -46.63098 Haemagogus leucocelaenus 11/01/2018

What do you notice?

  • From “ID”, and “Location”, data were collected in Horto (PEAL) but also in the north of Sao Paulo.

  • In particular, we have a column called “Accession_number”, which tells us whether the virus host is a mosquito or a non-human primate, and “Host_species”, which refers to its species type.

How much data do we have on mosquitoes and on non-human primates? We can easily determine this by filtering the data to create a bar chart using ggplot2.

First, we’ll need to load ggplot2 into our library in RStudio. We will also need to load tidyverse as it will provide us functions that can help us filter our data for plotting. If you do not have these packages, you can install them by running, for instance, install.packages(“ggplot2”), and install.packages(“tidyverse”) in RStudio.

library(ggplot2)
library(tidyverse)

To create a bar chart, we need to determine the total number of observations per species and host type. We then plot this number by host type and species type. This takes a few steps, as shown in the code below.

You will notice that each step ends with % - this is a pipe from the tidyverse package that tells us the sequence of steps we are taking.

Try the code below in RStudio to produce the bar chart. Notice I’ve made comments about each step in our code using # symbol. This allows us to make notes that do not impact the analysis.

horto %>%
  # Determine the count of species for mosquitoes and non-human primates
  group_by(Accession_Number,Host_species) %>%
  summarise(count=n()) %>%
  # Reorder the data in ascending order. 
  mutate(Accession_Number = reorder(Accession_Number,count, increasing = T)) %>%
  # Create a bar chart using ggplot.
  ggplot(aes(x = Accession_Number, count, fill = Host_species)) +
  geom_col(position = "dodge")

Figure 1: Bar chart showing distribution of species by host type.

We can see that we have one species of mosquitoes, Haemagogus leucocelaenus, and one species of NHP, Alouatta.

Let’s now map these cases to understand its spatial distribution in Horto. Have a look again at the dataframe - what information here would be useful for mapping?

Mapping basics

If we want to map a data set, we will need to look for location-related information. There are five columns in our data frame that tells us information about the location of the data set.

  • Country

  • State

  • Municipality

  • Location: Place where the sample was collected.

  • Latitude: part of the coordinate system (Y-axis), exact location north or south of the equator.

  • Longitude: part of the coordinate system (X-axis), exact location east or west of prime meridian at Greenwich.

Let’s now create a map showing Sao Paulo municipality, where Horto Florestal is located.

  • The package sf and ggspatial offers spatial feature handling and mapping capabilities.

  • geobr offers maps of Brazil containing vector data on location, shape, and geographical attributes at various administrative levels.

  • In this tutorial we will be extracting municipalities, conservation areas, and census tracts, all using geobr.

Let’s start with installing packages for geobr, sf, and ggspatial, then loading the libraries into R. You can do this using the function *install.packages(” “) - give this a go yourself :)

library(geobr)
library(sf)
library(ggspatial)

# Extract Sao Paulo municipality
sp_muni <- read_municipality(code_muni = "SP", year = 2020, showProgress = FALSE) %>%
  filter(code_muni == "3550308")

# head() allows us to get an overview of our data. We can see that contains spatial information and it's a MULTIPOLYGON. 
head(sp_muni)
Simple feature collection with 1 feature and 7 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -46.82619 ymin: -24.00826 xmax: -46.36531 ymax: -23.35629
Geodetic CRS:  SIRGAS 2000
  code_muni name_muni code_state abbrev_state name_state code_region
1   3550308 São Paulo         35           SP  São Paulo           3
  name_region                           geom
1     Sudeste MULTIPOLYGON (((-46.54624 -...
# Simple feature collection with 1 feature and 7 fields
# Geometry type: MULTIPOLYGON
# Dimension:     XY
# Bounding box:  xmin: -46.82619 ymin: -24.00826 xmax: -46.36531 ymax: -23.35629
# Geodetic CRS:  SIRGAS 2000
#   code_muni name_muni code_state abbrev_state name_state code_region name_region
# 1   3550308 São Paulo         35           SP  São Paulo           3     Sudeste
#                             geom
# 1 MULTIPOLYGON (((-46.54624 -...

# Create a map showing Sao Paulo municipality
ggplot() + 
  geom_sf(data = sp_muni, fill = "#008A60", colour="#FEBF57", size=2,show.legend = FALSE) +
  labs(subtitle = "Greater metropolitan area of Sao Paulo", size=8)

Let’s highlight the Horto Florestal area on the map. Because Horto Florestal is a conservation area, we can use geobr to extract the shapefile of the park from the function read_conservation_units(). The park attached to Horto is “PARQUE ESTADUAL ALBERTO LöFGREN”.

# Using geobr, import mapping feature details on all main conservation areas in Brazil
cons_areas <- read_conservation_units(showProgress = FALSE)

# Extract administrative boundary of Horto 
horto_shp <- cons_areas %>%
  filter(name_conservation_unit == "PARQUE ESTADUAL ALBERTO LöFGREN")

# Create a map showing the greater metropolitan area of Sao Paulo with Horto Florestal highlighted
ggplot() + 
  geom_sf(data = sp_muni, fill = "#008A60", colour="#FEBF57", show.legend = FALSE) +
  geom_sf(data = horto_shp, fill = "#FEBF57", colour="#FEBF57", show.legend = FALSE)

Now let’s create a zoomed in version of our map showing Horto and its surrounding neighbourhoods. We will represent neighbourhoods at the census tracts administrative level using the data from geobr.

# Using geobr, import census tracts for Sao Paulo municipality 
census_tracts  <- read_census_tract(code_tract = 3550308, year = 2020, showProgress = FALSE)

# Crop census tract feature so we can zoom into the neighbouring areas containing Horto Florestal
# Using I've indicated the spatial boundaries for our map 
cts_crop <- st_crop(census_tracts, ymin = -23.48, ymax = -23.44,
                  xmin = -46.66, xmax = -46.62)
# Map Horto within the SP municipality boundary 
ggplot() + 
  geom_sf(data = cts_crop, aes(geometry = geom), fill = "#ede9dd", colour = "#238B45", show.legend = FALSE) +
  geom_sf(data = horto_shp, aes(geometry = geom), fill="#FEBF57", size = 10, show.legend = FALSE) 

Mapping samples collected from Horto Florestal

Let’s now create a map of Horto showing the locations of the Haemagogus samples from our Horto data set. First, we’ll need to convert our data, the dataframe, to a spatial feature for mapping. Then, we will run some sf functions to ensure our data can be mapped. We then use the subset() function to extract data on Haemagogus only. Lastly We plot the sample locations in Horto and its surrounding neighbouring areas.

# Convert data frame to a spatial feature
horto_data <- st_as_sf(horto, coords = c("longitude","latitude")) 

# Ensure it uses the same coordinate system as the SP municipality boundary 
st_crs(horto_data) <- st_crs (sp_muni)

# Create a points spatial feature with latitude and longitude coordinates attached
horto_lat_long <- cbind(horto_data, st_coordinates(horto_data))

# Extract data on Haemagogus only
horto_hg <- subset(horto_lat_long, Host_species == "Haemagogus leucocelaenus")

# Create a map showing distribution of Haemagogus. Sample locations of Haemagogus will be indicated using blue points. 
ggplot() + 
 geom_sf(data = cts_crop, aes(geometry = geom), fill = "#ede9dd", colour = "#238B45", show.legend = FALSE) +
    geom_sf(data = horto_shp, fill = "#FEBF57", colour="#FEBF57", show.legend = FALSE) + 
  geom_point (data = horto_hg, aes(x = X, y = Y, colour = Host_species), shape = 1, stroke = 1, size = 0.05) +
  scale_color_manual(values=c("#3399FF")) 

Only 9 samples were collected. What can you infer about sampling based on its spatial distribution? We can see that some samples are collected in the same location.

Can you revise the code above and create one for Alouatta? You will need to use the subset function to create a new data set for Alouatta only by extracting the data from horto_lat_long.

Below are some hints…

41 samples were collected for Alouatta. What can you infer about the spatial distribution of sampling? We can see that samples were taken within and at the borders of Horto Florestal.

Let’s now create a map showing both host species.

# Map showing Haemagogus (blue) and Alouatta (orange) sample locations
ggplot() + 
 geom_sf(data = cts_crop, aes(geometry = geom), fill = "#ede9dd", colour = "#238B45", show.legend = FALSE) +
    geom_sf(data = horto_shp, fill = "#FEBF57", colour="#FEBF57", show.legend = FALSE) + 
  geom_point (data = horto_lat_long, aes(x = X, y = Y, colour = Host_species), shape = 1, stroke = 1, size = 0.05) +
  scale_color_manual(values=c("#E56D00","#3399FF"))

Figure 2: Map showing spatial distribution of samples by host

Let’s now format our map a bit by adding a north arrow, scale bar, and remove the grid from our map. We will be using the ggspatial package to do this. Click Zoom above the figure in RStudio to see the map with updated formatting.

ggplot() + 
 geom_sf(data = cts_crop, aes(geometry = geom), fill = "#ede9dd", colour = "#238B45", show.legend = FALSE) +
    geom_sf(data = horto_shp, fill = "#FEBF57", colour="#FEBF57", show.legend = FALSE) + 
  geom_point (data = horto_lat_long, aes(x = X, y = Y, colour = Host_species), shape = 1, stroke = 1, size = 1) +
  scale_color_manual(values=c("#E56D00","#3399FF")) +
  # Add scalebar
  annotation_scale(location = "bl",
                   pad_x = unit(0.75, "cm")) + 
  # Add north arrow
  annotation_north_arrow(location = "tr", which_north = "true",
                         pad_x = unit(1, "cm"),
                         pad_y = unit(1, "cm"),
                         height = unit(0.75, "cm"),
                         width = unit(0.5, "cm")) + 
  # Reformat the map 
  theme(legend.position="bottom", 
        panel.border = element_blank(),
        axis.title.x=element_blank(), 
        axis.title.y=element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  theme_void()

Since samples were taken in both 2017 and 2018, let’s visualise this again by year.

We will need to using the package lubridate to extract year information from our data. First, ensure this package is installed and loaded in RStudio.

Here we are adding a new line “facet_wrap(~year)” into our ggplot code. This line will create a map for 2017 and 2018.

library(lubridate)

#convert date in our data column to a date format recognizable by R
horto_lat_long$newdate<-as.Date(horto_lat_long$date, format = "%Y-%m-%d")

#create a new column with year only
horto_lat_long$year <- year(horto_lat_long$newdate)

ggplot() + 
 geom_sf(data = cts_crop, aes(geometry = geom), fill = "#ede9dd", colour = "#238B45", show.legend = FALSE) +
    geom_sf(data = horto_shp, fill = "#FEBF57", colour="#FEBF57", show.legend = FALSE) + 
  geom_point (data = horto_lat_long, aes(x = X, y = Y, colour = Host_species), shape = 1, stroke = 1, size = 1) +
  scale_color_manual(values=c("#E56D00","#3399FF")) +
  facet_wrap(~year) + 
  annotation_scale(location = "bl",
                   pad_x = unit(0.75, "cm")) + # add scale
  annotation_north_arrow(location = "tr", which_north = "true",
                         pad_x = unit(1, "cm"),
                         pad_y = unit(1, "cm"),
                         height = unit(0.75, "cm"),
                         width = unit(0.5, "cm")) + 
  theme(legend.position="bottom", 
        panel.border = element_blank(),
        axis.title.x=element_blank(), 
        axis.title.y=element_blank(),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) + 
  theme_void()

##Next steps

Can you think of ways that we can improve our map?